Goto

Collaborating Authors

 temporal fusion




CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection

Neural Information Processing Systems

Accurate and robust 3D object detection is a critical component in autonomous vehicles and robotics. While recent radar-camera fusion methods have made significant progress by fusing information in the bird's-eye view (BEV) representation, they often struggle to effectively capture the motion of dynamic objects, leading to limited performance in real-world scenarios. In this paper, we introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion to address this challenge. Our approach comprises three key modules: Multi-View Fusion (MVF), Motion Feature Estimator (MFE), and Motion Guided Temporal Fusion (MGTF).



Query-based Temporal Fusion with Explicit Motion for 3D Object Detection

Neural Information Processing Systems

Existing methods either conduct temporal fusion based on the dense BEV features or sparse 3D proposal features. However, the former does not pay more attention to foreground objects, leading to more computation costs and sub-optimal performance.



Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps

Pham, Khanh Son, Witte, Christian, Behley, Jens, Betz, Johannes, Stachniss, Cyrill

arXiv.org Artificial Intelligence

-- Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner . T o address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a significant margin, highlighting the benefits of our modeling scheme.


CRT-Fusion: Camera, Radar, Temporal Fusion Using Motion Information for 3D Object Detection

Neural Information Processing Systems

Accurate and robust 3D object detection is a critical component in autonomous vehicles and robotics. While recent radar-camera fusion methods have made significant progress by fusing information in the bird's-eye view (BEV) representation, they often struggle to effectively capture the motion of dynamic objects, leading to limited performance in real-world scenarios. In this paper, we introduce CRT-Fusion, a novel framework that integrates temporal information into radar-camera fusion to address this challenge. Our approach comprises three key modules: Multi-View Fusion (MVF), Motion Feature Estimator (MFE), and Motion Guided Temporal Fusion (MGTF). The MFE module conducts two simultaneous tasks: estimation of pixel-wise velocity information and BEV segmentation.


CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

Ye, Zhangchen, Jiang, Tao, Xu, Chenfeng, Li, Yiming, Zhao, Hang

arXiv.org Artificial Intelligence

Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}.


StreamMOS: Streaming Moving Object Segmentation with Multi-View Perception and Dual-Span Memory

Li, Zhiheng, Cui, Yubo, Zhong, Jiexi, Fang, Zheng

arXiv.org Artificial Intelligence

--Moving object segmentation based on LiDAR is a crucial and challenging task for autonomous driving and mobile robotics. Most approaches explore spatio-temporal information from LiDAR sequences to predict moving objects in the current frame. However, they often focus on transferring temporal cues in a single inference and regard every prediction as independent of others. This may lead to inconsistent segmentation results for the same object across different frames. T o solve this issue, we propose a streaming network with a memory mechanism, called StreamMOS, to build the association of features and predictions among multiple inferences. Specifically, we utilize a short-term memory to convey historical features, which can be regarded as spatial priors of moving objects and are used to enhance current inference by temporal fusion. Meanwhile, we build a long-term memory to store previous predictions and exploit them to refine current forecasts at the voxel and instance levels through voting. Besides, we apply multi-view encoder with cascaded projection and asymmetric convolution to extract motion feature of objects in different representations. Extensive experiments validate that our algorithm gets competitive performance on SemanticKITTI and Sipailou Campus datasets. N urban roads, there are often many dynamic objects with variable trajectories, such as vehicles and pedestrians, which create the collision risk for autonomous vehicles. Meanwhile, these moving objects will cause errors in simultaneous localization and mapping (SLAM) [1], as well as pose challenges for obstacle avoidance [2] and path planning [3].